The choice of whether to use parallel CPU target for any given algorithm depends on a number of factors.
This notebook illustrates how the 'size' and 'shape' of a given dataset may be a factor, for a simple algorithm which is equivalent to the default implementation of sklearn.preprocessing.StandardScaler.
More details on numba's parallelisation features are given in the excellent numba docs
In [1]:
import numpy as np
import pandas as pd
from numba import njit, prange
from pytest import approx
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import matplotlib
matplotlib.rc('figure', figsize=(10, 5))
Note regding the following algorithms:
In [2]:
@njit(parallel=False)
def standard(A):
"""
Standardise data by removing the mean and scaling to unit variance,
equivalent to sklearn StandardScaler.
"""
n = A.shape[1]
res = np.empty_like(A, dtype=np.float64)
for i in range(n):
data_i = A[:, i]
res[:, i] = (data_i - np.mean(data_i)) / np.std(data_i)
return res
In [3]:
@njit(parallel=True)
def standard_parallel(A):
"""
Standardise data by removing the mean and scaling to unit variance,
equivalent to sklearn StandardScaler.
Uses explicit parallel loop; may offer improved performance in some
cases.
"""
n = A.shape[1]
res = np.empty_like(A, dtype=np.float64)
for i in prange(n):
data_i = A[:, i]
res[:, i] = (data_i - np.mean(data_i)) / np.std(data_i)
return res
We're going to use the IRIS dataset (150 rows x 4 columns) for this test.
In [4]:
A = load_iris().data
In [5]:
expected = StandardScaler().fit_transform(A)
In [6]:
output = standard(A)
In [7]:
np.allclose(output, expected)
Out[7]:
In [8]:
output_parallel = standard_parallel(A)
In [9]:
np.allclose(output_parallel, expected)
Out[9]:
In [10]:
def highlight_min(s):
is_min = s == s.min()
return ['background-color: yellow' if v else '' for v in is_min]
Firstly, we're going to tile the data 'horizontally' so that we keep the same number of rows but add successively more columns.
In [11]:
res = []
multiples = range(1, 42, 5)
for idx, i in enumerate(multiples):
data = np.tile(A, i)
o_1 = %timeit -o -q StandardScaler().fit_transform(data)
o_2 = %timeit -o -q standard(data)
o_3 = %timeit -o -q standard_parallel(data)
res.append((data.shape[1], o_1.best, o_2.best, o_3.best))
print('{0} of {1} complete {2}'.format(idx + 1, len(multiples), data.shape))
In [12]:
df = pd.DataFrame(res, columns = ['num_cols', 'sklearn', 'numba CPU', 'numba CPU parallel'])
In [13]:
df = df.set_index('num_cols')
df = df.apply(lambda x: 1000 * x)
In [14]:
ax = df.plot()
ax.set_title('Standard scale data: 150 rows by n columns')
ax.set_xlabel('Number of columns')
ax.set_ylabel('Time (ms)')
plt.legend(prop={'size': 14})
Out[14]:
In [15]:
df.style.apply(highlight_min, axis=1)
Out[15]:
In the above results, observe the crossing point where CPU parallel is (and remains) the fastest computational strategy.
Furthermore, observe its relative insensitivity to number of columns.
Next, we're going to repeat the experiment but this time tiling the data 'vertically' so that we add successively more rows.
In [17]:
res = []
for idx, i in enumerate(multiples):
data = np.tile(A.T, i).T
o_1 = %timeit -o -q StandardScaler().fit_transform(data)
o_2 = %timeit -o -q standard(data)
o_3 = %timeit -o -q standard_parallel(data)
res.append((data.shape[0], o_1.best, o_2.best, o_3.best))
print('{0} of {1} complete {2}'.format(idx + 1, len(multiples), data.shape))
In [18]:
df = pd.DataFrame(res, columns = ['num_rows', 'sklearn', 'numba CPU', 'numba CPU parallel'])
In [19]:
df = df.set_index('num_rows')
df = df.apply(lambda x: 1000 * x)
In [20]:
ax = df.plot()
ax.set_title('Standard scale data: n rows by 4 columns')
ax.set_xlabel('Number of rows')
ax.set_ylabel('Time (ms)')
plt.legend(prop={'size': 14})
Out[20]:
In [21]:
df.style.apply(highlight_min, axis=1)
Out[21]:
In this case, observe that CPU parallel is almost never optimal.
Furthermore, observe its sensitivity to number of rows.